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Abstract 


Computer  vision  applications  that  work  with  videos  of¬ 
ten  require  that  the  foreground,  region  of  interest,  be  clearly 
segmented  from  the  un-intere  sting  background.  To  address 
this  problem,  we  present  a  general  framework  for  scene 
modelling  and  robust  foreground  detection  that  works  un¬ 
der  difficult  conditions  such  as  moving  camera  and  dynamic 
background.  This  is  achieved  by  first  representing  the  scene 
as  a  union  of  pixel  layers,  and  then  propagating  these  layers 
through  the  video  by  a  maximum-likelihood  (ML)  assign¬ 
ment  of  pixels  to  the  different  layers.  The  possibility  of  a 
pixel  not  belonging  to  any  of  the  layers  in  the  scene  is  also 
one  of  the  hypotheses  that  are  automatically  tested  during 
the  maximum-likelihood  assignment. 

The  proposed  approach  has  a  number  of  salient  virtues. 
Firstly,  the  clustering  and  layering  is  automatic,  while  the 
feature-space  can  be  user  defined  to  suit  the  application. 
Secondly,  the  cluster  propagation  step  implicitly  performs 
layer  tracking  along  with  foreground  detection.  Standard 
pixel  based  scene  modelling  techniques  become  a  particular 
case  of  our  general  framework,  when  all  pixels  in  the  scene 
are  independent  and  distinct  from  each  other  and  belong  to 
separate  clusters.  It  is  observed  that  pixels  belonging  to  the 
same  clusters  in  the  feature  space  usually  map  to  spatially 
connected  layers  in  the  image  space,  leading  us  to  consider 
that  useful  correlation  exists  between  features  of  pixels  in 
the  spatial  vicinity.  This  permits  to  deal  with  camera  motion 
with  none  or  nominal  registration.  We  illustrate  our  ideas 
with  a  number  of  interesting  and  difficult  real-life  examples. 


1.  Introduction  And  Previous  Work 

Robustly  segmenting  foreground  is  an  important  require¬ 
ment  for  many  computer  vision  algorithms  including  track¬ 
ing,  identification,  and  surveillance.  Although  there  is  often 
no  prior  information  available  about  the  foreground  object 
to  be  segmented,  in  most  situations  the  background  scene 
is  available  in  all  frames  of  the  video,  and  hence  can  be 
learned  or  modelled.  This  allows  for  segmenting  the  fore¬ 
ground  by  “subtracting”  the  background  from  the  scene,  in¬ 
stead  of  explicitly  modelling  the  foreground. 

Background  subtraction  by  simple  frame  differences  as 
in  [  6 1  has  not  been  very  successful  in  most  real-life  situ¬ 
ations.  Natural  scenes  often  consist  of  dynamic  elements 
like  ripples  in  water,  trees  swaying  with  wind,  escalators 
at  airports,  and  moving  crowds,  making  background  mod¬ 
elling  in  video  a  very  challenging  problem.  In  [4],  a  sta¬ 
tistical  modelling  of  the  background  is  performed  using  the 
assumption  that  the  imaging  sensor  (camera)  is  completely 
stationary,  to  simplify  the  problem  (although  the  problem 
still  remains  challenging).  Camera  stationarity  is  a  tight 
constraint,  and  in  many  practical  scenarios,  the  assumption 
of  a  stationary  sensor  is  violated,  as  for  example  in  the  case 
of  camera  shake  due  to  wind,  support  vibrations,  hand  in¬ 
stability  (in  case  of  hand-held  cameras),  panning  or  tilting 
(in  case  of  a  PTZ-camera),  and  general  camera  motion  in¬ 
duced  by  embedding  cameras  on  flying  drones.  Wada  et 
al.  1 15 1  have  proposed  a  method  for  handling  pan-tilt-zoom 
camera  motion  by  using  an  appearance  sphere.  Planar  cam¬ 
era  motion  can  also  be  handled  by  precise  registration  of 
the  frames  to  create  a  background  image  which  is  then  sub¬ 
tracted  from  the  scene  to  segment  the  foreground.  Such 
type  of  registration  is  very  sensitive  to  noise  and  in  most 
natural  cases  is  not  very  precise  (and  also  time  consuming). 
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Thus,  scene-modelling  must  be  robust  enough  to  accommo¬ 
date  pixel  position  uncertainty,  and  minimize  or  completely 
avoid  pixel  registration.  In  our  proposed  framework,  the 
spatio-temporal  correlation  between  pixels  is  exploited  to 
provide  priors  on  the  set  of  pixel-clusters  or  layers  that  any 
pixel  can  belong  to.  This,  in  turn,  allows  to  handle  camera 
motion,  without  any  explicit  registration,  while  robustly  de¬ 
tecting  the  foreground  as  illustrated  by  our  results  in  Section 
4 

Background  modelling  techniques  are  categorized 
mainly  as  pixel  based  and  region  based  methods.  The  ma¬ 
jority  of  background  modelling  methods  in  current  litera¬ 
ture  are  of  the  first  type.  The  initial  approaches  in  this  regard 
(most  notably  [16  ])  assumed  that  the  pdf  of  a  pixel  at  loca¬ 
tion  (x,  y )  can  be  modelled  by  a  single  3-D  Gaussian  distri¬ 
bution  N(p,(x,y),Ti(x,y)).  The  mean  (p)  and  variance  (E) 
are  estimated  at  each  pixel  position  over  a  set  of  images,  and 
then  the  likelihood  of  a  pixel  belonging  to  the  background 
can  be  computed,  in  order  to  assign  it  to  the  background  or 
foreground.  Finding  a  single  Gaussian  unsuitable  to  model 
a  color  pixel,  in  [  5  12  ],  and  more  recently  in  [  8  ],  each  pixel 
is  modelled  as  a  mixture  of  a  pre-determined  number  of 
Gaussians.  This  helps  to  address  the  different  “modes”  in 
the  behavior  of  each  pixel.  The  problem  of  dynamic  back¬ 
ground  has  been  recently  addressed  by  Mittal  and  Paragios 
[7 1,  where  the  authors  propose  an  adaptive  Kernel  Density 
Estimation  technique  that  uses  optical  flow  as  well  as  color 
to  generate  a  5-D  pdf  for  each  pixel.  This  technique  works 
well  with  a  stationary  camera  constraint,  but  is  difficult  to 
generalize  to  a  moving  camera  scenario.  We  provide  a  more 
general  framework  which  exploits  spatial  correlation  among 
pixels  to  handle  camera  motion.  Critical  thresholds  in  [7 1 
are  automatically  handled  in  our  work  using  an  a-contrario 
model. 

In  region  based  methods,  the  background  is  modelled  as 
a  group  of  regions.  The  authors  in  [  13  ]  describe  a  three  level 
algorithm  where  region-level  and  frame-level  information  is 
used  to  make  decisions  at  the  pixel-level.  In  [19],  Zhong  et 
al.  have  used  a  Kalman  filter  for  modelling  image  regions 
as  an  autoregressive  moving  average  (ARMA)  process.  Re¬ 
cently,  Sheikh  and  Shah  [10],  have  proposed  exploiting  spa¬ 
tial  correlation  among  pixels  by  using  the  position  of  a  pixel 
along  with  the  color  to  generate  a  single  5-D  pdf  that  models 
the  entire  image. 

In  this  paper,  we  present  a  more  generalized  and  si¬ 
multaneously  simpler  method  for  image  layering  or  pixel 
cluster  based  background  modelling.  The  background  is 
first  segmented  into  “easy  to  model”  clusters  of  pixels. 
Subsequently,  pixels  of  incoming  frames  are  either  as¬ 
signed  to  one  of  the  existing  clusters  (layers)  based  on 
maximum-likelihood,  or  categorized  as  “outliers”  via  au¬ 
tomatic  threshold  computation.  This  classification  helps 
to  propagate  clusters  in  downstream  image  frames  while 


achieving  a  finer  and  more  correct  definition  of  layers.  The 
following  sections  describe  our  framework  in  more  detail. 

1.1.  Algorithm  Overview 

Offline  Step 
Training  Step: 

(LILO  stack  of  first  M  frames) 

■  Decompose  Scene  into  Layers, 

S=  {L,-}  i=  1,...,N 

■  Compute  bandwidth  Lf  corresponding 
to  pixels  in  L* 

■  Compute  threshold  x?  for  L* ,  such  that 
“Number  of  False  Alarms”  <  1 
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Online  Step  (repeated  for  all  frames  >  M) 

Laver  Propagation  &  Outlier  Detection: 

(  Current  frame  is  Ft ,  t  >  M) 

■  Assign  pixels  x  in  Ft  to  one  of  L*  using 
maximum-likelihood  (ML) 

■  Lo  is  layer  of  outliers 

Update  Step: 

■  Update  training  frames  (Ft  is  added  at 
the  end  of  the  training  stack ) 


Figure  1.  Overview  of  the  proposed  modelling  and  foreground  de¬ 
tection  framework 

Figure  1  gives  a  brief  overview  of  our  algorithm. 
The  first  few  frames  (say  M)  of  a  video  sequence  form 
the  “Last-In-Last-Out”  (LILO)  stack  of  training  frames. 
Frames  in  this  stack  are  layered  into  N  layers,1  {Li}fLx, 
using  a  Sampling-Expectation  (SE)  algorithm  proposed  in 
[18 1 .  We  simultaneously  and  automatically  compute  the 
band  widths  (Hf  s),  and  thresholds  (rf  s)  corresponding  to 
each  layer  (Li),  that  are  required  for  the  ML-based  pixel 
assignment  (refer  to  Section  3). 

In  the  layer  propagation  step,  each  pixel  is  assigned  to 
one  of  the  layers  using  maximum-likelihood.  The  possi¬ 
ble  layers  that  each  pixel  can  belong  to  are  short-listed  by 
observing  the  layer  labels  from  the  training  stack  in  the 
spatio-temporal  vicinity  of  the  pixel.  The  layer  L0  (initially 
empty)  is  the  layer  of  outliers.  Outliers  are  detected  using 
the  thresholds  (rj  automatically  computed  during  the  train¬ 
ing  step,  see  Section  3.2  Once  all  the  pixels  are  assigned 
to  a  layer,  the  frame  is  added  (“pushed”)  at  the  end  of  the 
LILO  stack  and  the  oldest  frame  in  the  stack  is  released 
(“popped”). 

!The  words  “layering”  and  “clustering”  have  been  used  interchange¬ 
ably  as  it  is  often  observed  that  clusters  in  the  feature-space  usually  corre¬ 
spond  to  connected  layers  in  the  image  domain. 


Section  2  and  Section  3  describe  in  detail  the  main  com¬ 
ponents  of  the  algorithm.  We  then  proceed  by  presenting 
some  interesting  and  challenging  results  in  Section  4  Dis¬ 
cussion  of  limitations  of  the  proposed  framework  and  future 
work  is  given  in  Section  5 

2.  Automatic  Layering 

In  our  framework,  the  scene  is  modelled  as  a  group  of 
pixel  clusters  or  layers.  This  is  done  in  order  to  exploit  re¬ 
dundancy  in  the  features  of  pixels  belonging  to  the  same 
cluster.  Here,  we  have  considered  only  the  colors  of  a  pixel 
as  the  features,  but  the  feature  space  can  be  completely  user 
defined  (e.g.,  can  include  optical  flow,  leading  to  a  5 D  vec¬ 
tor,  see  discussion  in  Section  5 ).  This  section  describes  the 
method  that  we  use  to  automatically  cluster  the  pixels.  It  is 
observed  that  such  clusters  in-fact  correspond  to  connected 
layers  in  the  image  domain.2  For  a  review  of  color  image 
layering  techniques  in  the  literature  please  refer  to  [2|. 

2.1.  Initial  Guess 

We  first  compute  the  color  C  corresponding  to  a  local 
maximum  ( hmax )  of  the  histogram  of  the  image.  All  pixels 
with  colors  lying  inside  a  particular  radius  (p)  of  C  form  our 
candidate  layer  ( Lc ).  The  histogram  maximum  and  radius 
are  computed  as  in  [3 1. 

2.2.  Refinement  Step 

Once  an  initial  guess  for  a  layer  is  obtained,  it  is  im¬ 
portant  to  add  or  remove  pixels  from  the  layer  depending 
on  consistency  of  features,  in  order  to  improve  the  homo¬ 
geneity  and  integrity  of  the  layer.  This  is  done  by  a  re¬ 
finement  step  that  uses  the  Sampling-Expectation  (SE)  tech¬ 
nique  proposed  in  [  18  ].  There  are  3  main  steps  in  this  refin¬ 
ing  process: 

•  It  is  assumed  that  the  pixels  in  the  candidate  layer  and 
the  rest  of  the  image  come  from  two  separate  pro¬ 
cesses.  To  initialize  these  processes,  start  with  an  ini¬ 
tial  distribution  Plc  on  the  image  pixels,  with  pixels 
belonging  to  Lc  having  high  values  and  pixels  not  in 
Lc  getting  low  values  (with  a  gradual  spatial  decay, 
for  example  a  Gaussian  distribution).  This  probabil¬ 
ity  Plc  indicates  our  confidence  about  the  chance  that 
the  pixel  belongs  to  Lc-  Pbg  forms  the  complementary 
background  process. 

•  S-step:  The  image  is  uniformly  sampled  to  get  a  set 
of  samples  S  =  {xi}'fL1.  Generally  a  sample  size  of 
about  10  to  20  percent  of  the  pixels  in  the  image  has 
been  found  to  be  satisfactory. 

Connectivity  can  be  explicitly  forced,  e.g.,  by  adding  more  dimensions 
to  the  feature  vector  that  indicates  (say)  the  median  color  of  the  neighbor¬ 
hood. 


•  E-step:  Pixels  are  assigned- to  or  removed  from  Lc 
based  on  maximum-likelihood,  i.e.,  if  Plc  >  Pbg ,  the 
pixel  is  assigned  to  Lc ,  else  is  removed. 

The  S  and  E  steps  are  iterated  until  the  composition  of 
Lc  becomes  stable.  In  the  above  algorithm,  the  likelihood 
of  a  pixel  belonging  to  one  of  the  two  processes  is  computed 
using  a  weighted  Kernel  Density  Estimation.  For  details 
about  the  concept  of  Kernel  Density  Estimation  the  reader  is 
referred  to  1 1 1  ].  Given  a  pixel  y  belonging  to  the  image,  we 
estimate  the  probabilities  Plc(v)  and  Pbg{y )  by  first  com¬ 
puting  the  following  parameters  (as  in  [  18  ]): 

m  d 

wLc(y)  =  (1) 

i=l  j= 1 

m  d 

wbg(y)  =  (2) 

i=  1  n=1  1 


(a)  A  candidate  ( bottom-left )  layer  is  extracted  from  the  original 
image  (top)  and  refined  using  iterative  Sampling-Expectation  to 
get  the  final  layer  ( bottom-right ). 


(b)  Layers  extracted  from  the  original  image  in  (a). 
Figure  2.  The  automatic  layering  process. 


where,  K  is  the  kernel  or  smoothing  function  (we  use  a 
Gaussian  kernel),  d  is  the  dimension  of  the  feature-space 
(3  in  our  case,  5  if  we  incorporate  optical  flow)  and  hj’ s 
are  the  kernel  band  widths,  which  we  estimate  using  hj  « 
1.06(7^7715  1 9 1,  where  dj  is  the  standard  deviation  estimated 
over  the  sample  S,  in  dimension  j.  The  pixel  probabilities 
are  then  computed  as: 

PLc(y)  =  WLc/(WLc  +  Wbg)  (3) 

Pbg{y)  =  Whg/(WLc  +  Whg)  (4) 

The  “initial  guess”  (previously  extracted  layers  are  ex¬ 
cluded  for  computing  hmax ),  and  “refinement”  steps  are 
performed  repetitively,  generating  layers  in  the  image  do¬ 
main,  until  the  initial  guess  Lc  has  fewer  than  1%  of  pixels 
in  the  entire  image.  It  is  also  possible  that  some  pixels  are 
classified  as  belonging  to  multiple  layers  (as  the  SE  refine¬ 
ment  is  carried  out  over  the  entire  image).  These  pixels 
along  with  the  residual  un-assigned  pixels  are  assigned  to 
one  of  the  layers  using  maximum-likelihood.  This  process 
allows  us  to  describe  the  scene  §  as  §  =  U^Li  where 
Li  are  the  extracted  N  layers.  Note  that  both  the  layers  and 
their  number  N  are  automatically  computed. 

Figure  2(a)  shows  the  initial  guess  and  the  refined  final 
layer.  Observe  that  the  refinement  step  ensures  consistency 
in  the  layer  and  more  accurately  defines  its  boundaries.  All 
layers  extracted  from  the  original  frame  are  shown  in  Figure 
2  (b).  These  layers  are  very  similar  to  how  a  human  observer 
would  segregate  the  scene.  Pixels  in  these  layers  belong  to 
the  same  cluster  in  the  feature  space  and  are  also  spatially 
connected  as  seen  in  the  image. 

The  initial  training  stack  (T)  used  for  training  the  back¬ 
ground  model  consists  of  the  first  M  frames  in  the  se¬ 
quence.  The  first  frame  is  layered  using  the  above  tech¬ 
nique  and  the  remaining  frames  in  T  are  layered  using  the 
layer  labels  in  the  previous  frame  as  a  starting  point  for  the 
refinement  step. 

3.  Layer  Propagation 

Once  the  initial  training  stack  T  is  layered,  the  rest  of 
the  background  modelling  process  is  to  assign  all  incoming 
pixels  to  one  of  the  predetermined  layers  in  the  scene,  or 
identify  it  as  an  outlier/foreground  (assign  to  layer  L0). 

3.1.  Density  Estimation 

Similar  to  the  work  in  [10],  we  believe  that  there  exits 
meaningful  correlation  between  pixels  in  the  spatial  vicin¬ 
ity.  To  imbibe  this  correlation  into  our  framework,  we  use 
a  parameter  w,  which  indicates  the  registration  uncertainty 
or  the  spatial  variance  of  a  pixel.  This  user-defined  param¬ 
eter  gives  us  an  idea  of  the  size  {w  x  w  x  M,  where  M  is 
the  number  of  frames  in  the  stack  T),  of  spatio-temporal- 
neighborhood  of  pixels  in  the  training  stack  T,  that  may  be 


correlated  to  the  current  pixel.  All  pixels  (xi)  from  the  stack 
T,  assigned  to  layer  Li,  which  lie  in  the  “rc-vicinity”  of  the 
current  examined  pixel  (y),  form  the  sample  set  S$.  To  com¬ 
pute  the  probability  of  the  pixel  y  to  belong  to  any  layer  Li, 
we  use  a  Non-Parametric  Kernel  Density  Estimator  with  a 
Gaussian  kernel: 

&.(y)  =  W.r1/2(y-*0)  (5) 

where  is  the  number  of  samples  (xi’s)  belonging  to 
Li.  The  bandwidth  matrix  H  is  assumed  to  be  diagonal, 
H (Li)  =  h(Li) I,  where  the  argument  Li  is  used  to  indi¬ 
cate  all  samples  belonging  to  Li,  i.e.,  we  use  the  same  band¬ 
width  for  samples  from  one  layer,  when  computing  the  den¬ 
sity  estimate.  For  the  results  shown  in  this  paper,  we  have 
approximated  the  diagonal  values  (h(Li)’s)  with  the  stan¬ 
dard  deviation  of  all  the  training  samples  (Si),  as  is  done  in 
the  layer  refinement  step  (refer  to  Section  2.2  ).  The  layer 
of  outliers  (L0)  also  contributes  samples  to  the  likelihood- 
computation,  when  there  is  a  previously  detected  outlier  in 
the  w -vicinity.  Thus,  there  is  a  competitive  classification 
between  outliers  and  background,  at  the  same  time,  propa¬ 
gating  the  background  layers  throughout  the  video. 

3.2.  Threshold  Computation  And  Outlier  Detection 

Depending  upon  the  homogeneity  and  integrity  of  the 
pixels  belonging  to  the  layer,  each  layer  will  need  to  have 
a  different  threshold  to  achieve  the  same  “Number  of  False 
Alarms”  (NFA)  rate.  In  order  to  avoid  any  arbitrariness  in 
automatically  computing  these  thresholds  (ji  s),  we  use  the 
a-contrario  framework  [  1 ,  14 1.  Using  samples  (Si)  from  T 
that  belong  to  (say)  layer  Li,  we  compute  the  layer  probabil¬ 
ity  (PLifai))  of  all  pixels  x^  in  T  which  are  already  labeled 
as  belonging  to  Li.  This  allows  us  to  compute  the  proba¬ 
bility  that  a  pixel  y  belongs  to  layer  Li,  such  that  Pu{ y)  is 
less  than  a  certain  threshold  (say)  /i: 

p  (Li,n)  ■=  Pr{PLi(  y)  <  n\y  eLi)  (6) 

This  allows  us  to  say  that  a  pixel  z  is  an  e-meaningful  out¬ 
lier  from  the  layer  Li,  if  P(L^,  P^(z))  <  where  rii  is 
the  number  of  queried  pixels  in  T  belonging  to  Li.  The 
a-contrario  model  assumes  that  the  such  outliers  are  uni¬ 
formly  distributed,  hence  setting  e  =  1,  like  we  do,  allows 
us  to  ensure  that  the  average  number  of  false  detections  over 
the  layer  Li  is  less  than  one  (for  more  details  refer  to,  e.g., 
[  1  14  ]).  Thus  all  thresholds  are  computed  as  : 

n  =  min/i  s.t.  {P (Li,fi)  <2}  (7) 

Figure  3  illustrates  the  inverted  winning  maximum- 
likelihood  probabilities  (top-right)  for  all  the  pixels.  The 


Figure  3.  Image  on  the  top-right  shows  the  inverted  ( i.e ,  sub¬ 
tracted  from  1)  maximum-likelihood  probabilities,  corresponding 
to  the  original  frame  (top-left).  When  these  probabilities  are  plot¬ 
ted  in  a  3D  plot  (bottom),  the  moving  person  ( dark-brown ),  is  eas¬ 
ily  distinguishable. 


3D  perspective  view  shows  that  the  moving  person  (indi¬ 
cated  by  dark-brown  color  in  the  bottom  plot)  can  be  easily 
detected.  The  boundaries  of  the  layers  look  slightly  brighter 
(have  low  ML  probabilities)  because  of  color-blending  at 
the  edges.  Further  grouping  of  individual  outliers  as  indi¬ 
cated  in  [  1  ]  can  lead  to  even  more  robust  detections  than 
presented  here. 

The  bandwidths  and  thresholds  can  be  left  unchanged 
or  updated  every  few  frames.  The  color  space  that  we 
use  (same  as  in  [7])  is  robust  to  illumination  changes,  and 
thereby  adaptation  is  not  necessary  if  these  are  the  only  type 
of  changes  expected. 

4.  Results 

The  videos  described  in  this  section  (and  available  at 
www .  t  c  .  umn  .  edu/  ~patw0007  /  video  layers)  are 
outdoor  scenes  of  resolution  160x120.  The  algorithm  was 
implemented  using  C++,  on  a  machine  with  Intel-Pentium 
IV  1.8GHz  processor.  We  used  a  training  stack  of  30 
frames  for  all  the  results,  achieving  a  running  speed  of  1 
frame/second  (using  a  completely  un-optimized  experimen¬ 
tal  code).  Figure  4  illustrates  the  performance  of  our  algo¬ 
rithm  in  presence  of  moderately  dynamic  background  along 


with  a  lot  of  camera  shake  (please  observe  uploaded  orig¬ 
inal  videos).  The  outliers  detected  are  very  robust  to  the 
dynamics  of  the  background  and  the  camera  motion.  Figure 
5 1  shows  a  very  challenging  situation,  produced  by  highly 
perturbed  water  in  the  background.  Our  technique  performs 
very  well  to  distinguish  the  moving  person  (looks  inverted 
due  to  reflection)  from  the  ripples.  For  the  results  in  fig¬ 
ures  4  and  5 ,  we  have  used  a  w  parameter  value  of  3.  Hence 
the  spatio-temporal  training  window  for  each  pixel  is  of  size 
3  x  3  x  30.  Our  algorithm  does  not  need  very  precise  reg¬ 
istration.  If  the  uncertainty  in  the  registration  computation 
is  known,  it  can  be  figured  into  the  rc-parameter.  As  indi¬ 
cated  in  Figure  6 ,  in  spite  of  the  significant  camera  panning 
we  have  not  used  any  registration,  which  is  adjusted  for  by 
using  a  w  value  of  11,  indicating  the  increased  position  un¬ 
certainty.  Figure  6  also  indicates  how  well  the  layers  in  the 
scene  are  propagated  through  the  video  sequence  in  spite  of 
severe  camera  panning. 

5.  Discussion  And  Future  Scope 

In  this  work  we  have  proposed  a  general  framework 
for  scene  modelling  and  foreground  detection  using  pixel- 
clusters  (layers).  Redundancy  in  the  feature- space  and  spa¬ 
tial  correlation  in  the  image-domain  are  exploited  by  clus¬ 
tering  pixels  into  finite  number  of  layers  and  modelling  the 
scene  as  a  union  of  these  layers  rather  than  individual  pix¬ 
els.  The  task  at  hand  then  is  to  assign  any  incoming  pixel 
to  one  of  these  layers  or  as  an  outlier  by  using  a  maximum- 
likelihood  assignment,  which  allows  for  competitive  classi¬ 
fication  between  the  scene  layers  and  the  layer  of  outliers. 
Thresholds  are  chosen  in  a  non-arbitrary  fashion  to  give  ro¬ 
bust  outlier  detections.  The  results  presented  show  very  sat¬ 
isfactory  performance  in  very  difficult  environments. 

It  should  be  noted  that  we  have  used  only  the  color  in¬ 
formation  of  the  pixels,  and  results  can  be  further  improved 
by  adding  additional  information  like  optical  flow,  though  it 
should  be  noted  that  in  case  of  severe  camera  motion  (most 
result  videos  shown  here),  using  optical  flow  may  in-fact 
misguide  the  background  model  and  generate  false  alarms. 
In  circumstances  where  positional  uncertainty  is  large,  us¬ 
ing  very  high  w -values  is  not  efficient  and  also  degrades  ac¬ 
curacy.  Approximate  (rough)  registration  of  frames  within 
a  certain  error  bound  can  be  utilized  to  optimize  the  perfor¬ 
mance  of  our  algorithm  on  videos  with  severe  panning  or 
camera  motion.  The  run-time  can  be  improved  to  reach  real¬ 
time  by  optimization  and  using  the  Improved  Fast  Gauss 
Transform  (IFGT),  as  shown  in  [17],  for  fast  density  com¬ 
putation.  In  some  detection  results  a  few  noisy  false  alarms 
are  observed  which  can  be  further  removed  by  a  “meaning¬ 
ful”  grouping  of  outliers  as  proposed  in  [  1 1.  In  the  future 
we  would  also  like  to  address  the  appearance  of  novel  lay¬ 
ers  (not  seen  previously),  or  appearance  of  foreground  that 
remains  static  in  the  scene  (e.g.  a  car  arriving  in  a  parking 


lot  and  parked  for  a  long  time).  This  can  be  achieved  by 
simply  considering  a  “temporally  persistent”  group  of  out¬ 
liers  as  a  completely  new  depth-ordered  background  layer. 
Further  work  in  this  direction  will  be  reported  elsewhere. 
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Figure  5.  Large  amount  of  water  ripples  present  a  challenging  situation  for  foreground,  reflection,  detection  ( see  uploaded  video).  Original 
frames  are  shown  on  the  top  (#  0,  47,  62,  89)  along  with  outliers  on  the  bottom. 


Figure  6.  Trees  swaying  with  wind  along  with  camera  tilt  and  panning  are  very  difficult  scenarios  for  background  modelling  (see  uploaded 
video).  Outliers  are  shown  in  the  center  row,  while  the  propagation  of  various  layers  ( indicated  by  different  scales  of  gray)  is  shown  in  the 
bottom  row. 


